{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LeNet-5 Convolutional Neural Network\n",
    "====================\n",
    "\n",
    "In this tutorial I will show you how to implement the LeNet-5 CNN with tensorflow. LeNet-5, a pioneering 7-level convolutional network that classifies digits, was applied by several banks to recognise hand-written numbers on checks (cheques) digitized in 28x28 pixel images. This tutorial is based on the article *\"Gradient-based learning applied to document recognition\"* of LeCun et al. (1998) that I suggest you to read.\n",
    "\n",
    "This tutorial requires the **MNIST dataset**. The dataset can be downloaded and prepared following [this notebook](./../mnist/mnist.ipynb). Here we need the train and test files (in TFRecord format) that have been produced at the end of the notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Overview of LeNet-5 architecture\n",
    "----------------------------\n",
    "\n",
    "In this section we will give a look to the structure of LeNet-5. This will give you a wider perspective on the tensorflow implementation. LeNet-5 comprises 7 layers (not counting the input) that are identified with the letters: C=convolutional layer, S=subsampling layer, F=fully connected layer. It is important to notice that the original LeNet-5 has been trained on a full set of characters that included letters, digits and symbols. Here we are going to use only digits.\n",
    "\n",
    "<p align=\"center\">\n",
    "<img src=\"../etc/img/lenet5_architecture.png\" width=\"750\">\n",
    "</p>\n",
    "\n",
    "**Input**: the LeNet-5 has a 32x32 input matrix. However, the MNIST dataset has images of size 28x28. In the paper this is explained as a way to center distinctive features (e.g. stroke end-points, corners) in the receptive field of the highest-level feature detectors.\n",
    "\n",
    "**Layer C1**: is a convolutional layer with 6 feauture maps of size 28x28, and each unit in the feauture map is connected to a 5x5 neighbrhood in the input. There are 156 trainable parameters.\n",
    "\n",
    "**Layer S2**: is a sub-sampling layer with 6 feature maps of size 14x14, and each unit of the feature map is connected to a 2x2 neighborood in C1. The result of the sub-sampling is passed to a sigmoidal function. This layer has 12 trainable parameters.\n",
    "\n",
    "**Layer C3**: is another convolutional layer with 16 feauture maps. Each unit is connected to a 5x5 neighborood in S2. \n",
    "\n",
    "**Layer S4**: is a sub-sampling layer with 16 feauture maps of size 5x5. This layer has 32 trainable parameters.\n",
    "\n",
    "**Layer C5**: is a convolutional layer with 120 feauture maps connected to a 5x5 neighborood on all 16 feauture maps of S4. Since also the size of S4 is 5x5, the size of the feauture maps of C5 is 1x1 and it leads to a total of 48120 trainable parameters. This layer should be classified as fully-connected, however in the original paper they classified it as a convolutional layer because if the input features were made bigger (with everything else kept constant) the future maps dimension would be larget than 1x1.\n",
    "\n",
    "**Layer F6**: is a fully-connected layer that computes a dot product between the input vector coming from C5 and the internal weights. The number of trainable parameters is 10164. The activation function is an hyperbolic tangent.\n",
    "\n",
    "**Output**: is composed of Euclidean Radial Basis Function units (RBF) one for each class, with 84 inputs each. The output of each unit is computed as follows:\n",
    "\n",
    "$$y_{i} = \\sum_j{(x_{j} - w_{ij})^{2}}$$\n",
    "\n",
    "The equation represents the Euclidean distance between the input vector (the output of F6) and the weights. The further away is the input vector from the weight vector, the larger is the RBF output. In the original paper the components of the parameters vectors were manually chosen and set to -1 or +1. An alternative is to chose the those parameters at random with equal probabilities for -1 and +1. The representation of the output is distributed, meaning that characters that are similar (e.g. zero and uppercase O, one and uppercase I) will have similar output codes. It is important to rememeber that in the original implementation the network has been trained on a full set of characters (letters, symbols, digits), whereas here we are going to train the network only on digits. \n",
    "\n",
    "**Output (alternative version)**: in order to avoid the hand-tuning of the RBFs in the output we are going to use a common fully connected layer. Moreover since we only have 10 possible classes, we use a place-code representation (aka grand-mother cell code) where each digit is coded as a one-hot vector. The raw output of the network is passed through a softmax function and the resulting vector is compared with the one-hot target through a cross entropy error function.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Implementating the model in Tensorflow\n",
    "-------------------------------------------\n",
    "\n",
    "Here I will use the recently introduced **Estimators**, a high-level tensorflow API that greatly simplifies machine learning programming. Estimators encapsulate the following actions: training, evaluation, prediction, export for serving. Using estimators you can develop a state of the art model with high-level intuitive code. In short, it is generally much easier to create models with Estimators than with the low-level TensorFlow APIs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import tensorflow as tf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The training phase can be embedded into a single function defined as `model_fn`. The idea is to write our own function containing the model and pass it at training time. In the same function we also have to add specific operations to do in the three modalities: predict, train, eval.\n",
    "\n",
    "- When `model_fn` is called with `mode == ModeKeys.PREDICT`, the model function must return a `tf.estimator.EstimatorSpec` containing the following information: the mode, and the prediction. The model must have been trained prior to making a prediction. The trained model is stored on disk in the directory established when you instantiated the Estimator. \n",
    "\n",
    "- When `model_fn` is called with `mode == ModeKeys.TRAIN`, the model function must train the model.\n",
    "\n",
    "- When `model_fn` is called with `mode == ModeKeys.EVAL`, the model function must evaluate the model, returning loss and possibly one or more metrics.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def my_model_fn(features, labels, mode):\n",
    "    #Defining the CNN model\n",
    "    input_layer = tf.reshape(features, [-1, 32, 32, 1])\n",
    "    c1 = tf.layers.conv2d(inputs=input_layer, filters=6, kernel_size=[5, 5], \n",
    "                          padding=\"valid\", activation=tf.nn.relu)\n",
    "    s2 = tf.layers.max_pooling2d(inputs=c1, pool_size=[2, 2], strides=2, padding=\"valid\")\n",
    "    c3 = tf.layers.conv2d(inputs=s2, filters=16, kernel_size=[5, 5], \n",
    "                          padding=\"valid\", activation=tf.nn.relu) \n",
    "    s4 = tf.layers.max_pooling2d(inputs=c3, pool_size=[2, 2], strides=2)\n",
    "    s4_flat = tf.reshape(s4, [-1, 5 * 5 * 16])   \n",
    "    c5 = tf.layers.dense(inputs=s4_flat, units=120, activation=tf.nn.relu)   \n",
    "    f6 = tf.layers.dense(inputs=c5, units=84)\n",
    "    logits = tf.layers.dense(inputs=f6, units=10)\n",
    "    #PREDICT mode\n",
    "    if mode == tf.estimator.ModeKeys.PREDICT:\n",
    "        predictions = {\"classes\": tf.argmax(input=logits, axis=1),\n",
    "                       \"probabilities\": tf.nn.softmax(logits, name=\"softmax_tensor\"),\n",
    "                       \"logits\": logits}\n",
    "        return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)\n",
    "    #TRAIN mode\n",
    "    elif mode == tf.estimator.ModeKeys.TRAIN:\n",
    "        loss = tf.losses.softmax_cross_entropy(onehot_labels=labels, logits=logits)\n",
    "        optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)\n",
    "        train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())\n",
    "        accuracy = tf.metrics.accuracy(labels=tf.argmax(labels, axis=1), predictions=tf.argmax(logits, axis=1))\n",
    "        tf.summary.scalar('accuracy', accuracy[1]) #<-- accuracy[1] to grab the value\n",
    "        tf.summary.image(\"input_features\", tf.reshape(features, [-1, 32, 32, 1]), max_outputs=3)\n",
    "        tf.summary.image(\"c1_k1_feature_maps\", tf.reshape(c1[:, :, :, 0], [-1, 28, 28, 1]), max_outputs=3) #c1 kernel-1\n",
    "        tf.summary.image(\"c3_k2_feature_maps\", tf.reshape(c3[:, :, :, 0], [-1, 10, 10, 1]), max_outputs=3) #c3 kernel-1\n",
    "        logging_hook = tf.train.LoggingTensorHook({\"accuracy\" : accuracy[1]}, every_n_iter=200)\n",
    "        return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op, training_hooks =[logging_hook])\n",
    "    #EVAL mode\n",
    "    elif mode == tf.estimator.ModeKeys.EVAL:\n",
    "        loss = tf.losses.softmax_cross_entropy(onehot_labels=labels, logits=logits)\n",
    "        accuracy = tf.metrics.accuracy(labels=tf.argmax(labels, axis=1), predictions=tf.argmax(logits, axis=1))\n",
    "        eval_metric = {\"accuracy\": accuracy}\n",
    "        return tf.estimator.EstimatorSpec(mode=mode, loss=loss, eval_metric_ops=eval_metric)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can create the etimator object and link our model function to it. We also have to specify a folder where the log will be stored."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "lenet5 = tf.estimator.Estimator(model_fn=my_model_fn, model_dir=\"/tmp/tf_model\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Training the model\n",
    "---------------------\n",
    "\n",
    "Before starting the training on the model we need a dataset. Here we are going to use the MNIST digit dataset. The dataset can be downloaded and prepared following [this notebook](./../mnist/mnist.ipynb) available on the TensorBag repository. Here we need the train and test files (in TFRecord format) that have been produced at the end of the notebook. We are going to use the tensorflow `Dataset` class to manage the features and labels.\n",
    "\n",
    "We create an input pipeline called `my_input_fn()` that loads the dataset from a TFRecord file, preprocesses it, and then associate it to an iterator."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def my_input_fn():\n",
    "    def _parse_function(example_proto):\n",
    "        features = {\"image\": tf.FixedLenFeature((), tf.string, default_value=\"\"),\n",
    "                    \"label\": tf.FixedLenFeature((), tf.int64, default_value=0)}\n",
    "        parsed_features = tf.parse_single_example(example_proto, features)\n",
    "        image_decoded = tf.decode_raw(parsed_features[\"image\"], tf.uint8) #char -> uint8\n",
    "        image_reshaped = tf.reshape(image_decoded, [28, 28])\n",
    "        padding = tf.constant([[2, 2,], [2, 2]])\n",
    "        image_padded = tf.pad(image_reshaped, padding, \"CONSTANT\") #padding to 32x32\n",
    "        image = tf.cast(image_padded, tf.float32)\n",
    "        label_one_hot = tf.one_hot(parsed_features[\"label\"], depth=10, dtype=tf.int32)\n",
    "        #label = tf.reshape(label_one_hot, [1, 10])\n",
    "        return image, label_one_hot\n",
    "\n",
    "    tf_train_dataset = tf.data.TFRecordDataset(\"./mnist_train.tfrecord\")\n",
    "    tf_train_dataset = tf_train_dataset.map(_parse_function)\n",
    "    tf_train_dataset.cache() # caches entire dataset\n",
    "    #Setting a buffer_size greater than the number of examples in the Dataset \n",
    "    #ensures that the data is completely shuffled. \n",
    "    tf_train_dataset = tf_train_dataset.shuffle(buffer_size = 60000 * 2) # shuffle all the elements\n",
    "    #The repeat method has the Dataset restart when it reaches the end.\n",
    "    tf_train_dataset = tf_train_dataset.repeat(11) # repeats dataset this times\n",
    "    #The batch method collects a number of examples and stacks them, to create batches. \n",
    "    #This adds a dimension to their shape. The new dimension is added as the first dimension.\n",
    "    #The batch may have unknown batch size because the last batch can have fewer elements.\n",
    "    tf_train_dataset = tf_train_dataset.batch(32) # batch size\n",
    "    print \"Train dataset: \" + str(tf_train_dataset)\n",
    "    \n",
    "    iterator = tf_train_dataset.make_one_shot_iterator()\n",
    "    batch_features, batch_labels = iterator.get_next()\n",
    "    return batch_features, batch_labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have the datasets objects and we are ready to train the estimator. Before starting the training we define a **verbosity** level. There are different type of messages that can be printed: FATAL, ERROR, WARN, INFO, DEBUG."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "tf.logging.set_verbosity(tf.logging.INFO)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "lenet5.train(input_fn=my_input_fn, steps=2000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remember that you can follow the training on **tensorboard** running this command from terminal `tensorboard --logdir=/tmp/tf_model`. There are some useful information we may want to see.\n",
    "\n",
    "- *global_step/sec*: a performance indicator showing how many batches (gradient updates) we processed per second as the model trains.\n",
    "\n",
    "- *loss*: the reported loss\n",
    "\n",
    "- *accuracy (evaluation)*: use `eval_metric_ops={'my_accuracy': accuracy})`\n",
    "\n",
    "- *accuracy (training)*: use `tf.summary.scalar('accuracy', accuracy[1])`\n",
    "\n",
    "- *images*: we can check input images using `tf.summary.image()`\n",
    "\n",
    "The estimator will automatically save **checkpoints** at training time. Running the training again will resume the model from the latest checkpoint.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Evaluating the model on the test set\n",
    "-------------------------------------\n",
    "\n",
    "We are ready to evaluate the model on the tet set. First of all we have to load the test set and then we can use the `evaluate()` method.\n",
    "\n",
    "The method calls for each step `input_fn()`, which returns one batch of data. Evaluates until all the steps batches are processed, or `input_fn()` raises an end-of-input exception (`OutOfRangeError` or `StopIteration`). We set the parameters `steps=None` in the `evaluate()` method. If `None`, evaluates until `input_fn()` raises an end-of-input exception.\n",
    "\n",
    "The method **returns** a dict containing the evaluation metrics specified in `model_fn` keyed by name, as well as an entry `global_step` which contains the value of the global step for which this evaluation was performed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def my_eval_input_fn():\n",
    "    def _parse_function(example_proto):\n",
    "        features = {\"image\": tf.FixedLenFeature((), tf.string, default_value=\"\"),\n",
    "                    \"label\": tf.FixedLenFeature((), tf.int64, default_value=0)}\n",
    "        parsed_features = tf.parse_single_example(example_proto, features)\n",
    "        image_decoded = tf.decode_raw(parsed_features[\"image\"], tf.uint8) #char -> uint8\n",
    "        image_reshaped = tf.reshape(image_decoded, [28, 28])\n",
    "        padding = tf.constant([[2, 2,], [2, 2]])\n",
    "        image_padded = tf.pad(image_reshaped, padding, \"CONSTANT\") #padding to 32x32\n",
    "        image = tf.cast(image_padded, tf.float32)\n",
    "        label_one_hot = tf.one_hot(parsed_features[\"label\"], depth=10, dtype=tf.int32)\n",
    "        return image, label_one_hot\n",
    "\n",
    "    tf_test_dataset = tf.data.TFRecordDataset(\"./mnist_test.tfrecord\")\n",
    "    tf_test_dataset = tf_test_dataset.map(_parse_function)\n",
    "    tf_test_dataset.cache() # caches entire dataset\n",
    "    tf_test_dataset = tf_test_dataset.repeat(1) # repeats dataset this times\n",
    "    tf_test_dataset = tf_test_dataset.batch(1) # batch size\n",
    "    print \"Test dataset: \" + str(tf_test_dataset)   \n",
    "    \n",
    "    iterator_test = tf_test_dataset.make_one_shot_iterator()\n",
    "    batch_features, batch_labels = iterator_test.get_next()\n",
    "    return batch_features, batch_labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "lenet5.evaluate(input_fn=my_eval_input_fn, steps=10000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Predicting the class of new samples\n",
    "------------------------------------------\n",
    "\n",
    "If we want to deploy our model in an application we can use the method `predict()` in order to get the  probabilities of an input image. Here we take 10 images contained in the `./etc` folder and we use the model to predict their value. As usual we load the images in a dataset and then we use it as input to the `predict()` method. You can try to corrupt the images using an editor (e.g. Gimp, Inkscape, or Photoshop) and verify when the corruption affects the classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def my_predict_input_fn():\n",
    "    def _parse_function(filename):\n",
    "        image_string = tf.read_file(filename)\n",
    "        image_decoded = tf.image.decode_image(image_string)\n",
    "        image_resized = tf.reshape(image_decoded, [32, 32]) #tf.image.resize_images(image_decoded, [32, 32])\n",
    "        image = tf.cast(image_resized, tf.float32)\n",
    "        return image\n",
    "\n",
    "    # A vector of filenames\n",
    "    filenames = tf.constant([\"./etc/sample_0.png\", \"./etc/sample_1.png\", \"./etc/sample_2.png\", \n",
    "                             \"./etc/sample_3.png\", \"./etc/sample_4.png\", \"./etc/sample_5.png\", \n",
    "                             \"./etc/sample_6.png\", \"./etc/sample_7.png\", \"./etc/sample_8.png\", \n",
    "                             \"./etc/sample_9.png\"])\n",
    "\n",
    "    tf_predict_dataset = tf.data.Dataset.from_tensor_slices((filenames))\n",
    "    tf_predict_dataset = tf_predict_dataset.map(_parse_function)\n",
    "    tf_predict_dataset = tf_predict_dataset.repeat(1) # repeats dataset this times\n",
    "    tf_predict_dataset = tf_predict_dataset.batch(10) # batch size\n",
    "    \n",
    "    iterator_predict = tf_predict_dataset.make_one_shot_iterator()\n",
    "    batch_features = iterator_predict.get_next()\n",
    "    return batch_features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "predictions = lenet5.predict(input_fn=my_predict_input_fn)\n",
    "\n",
    "for i, prediction in enumerate(predictions):\n",
    "    print \"True class: \" + str(i)\n",
    "    print \"Predicted class: \" + str(prediction['classes'])\n",
    "    print \"Probabilities: \" + str(prediction['probabilities'])\n",
    "    print \"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Conclusions\n",
    "-------------\n",
    "\n",
    "In this tutorial we saw how the LeNet-5 CNN works and how it is possible to train it on a simple dataset. You can further improve the performance of the network using **dropout** and **adaprive gradient methods**.\n",
    "\n",
    "**Copyright (c)** 2018 Massimiliano Patacchiola, MIT License"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
